Journal of Machine Learning Research 12 (2011) 2857-2878 Submitted 4/11; Revised 9/11; Published 10/11 


Efficient Learning with Partially Observed Attributes* 


Nicolò Cesa-Bianchi NICOLO.CESA-BIANCHI@UNIML.IT 
DSI, Universita degli Studi di Milano 

via Comelico, 39 

20135 Milano, Italy 


Shai Shalev-Shwartz SHAIS @CS.HUJI.AC.IL 
The Hebrew University 
Givat Ram, Jerusalem 91904, Israel 


Ohad Shamir OHADSH @ MICROSOFT.COM 
Microsoft Research 


One Memorial Drive 
Cambridge, MA 02142, USA 


Editor: Russ Greiner 


Abstract 


We investigate three variants of budgeted learning, a setting in which the learner is allowed to 
access a limited number of attributes from training or test examples. In the “local budget” setting, 
where a constraint is imposed on the number of available attributes per training example, we design 
and analyze an efficient algorithm for learning linear predictors that actively samples the attributes 
of each training instance. Our analysis bounds the number of additional examples sufficient to 
compensate for the lack of full information on the training set. This result is complemented by a 
general lower bound for the easier “global budget” setting, where it is only the overall number of 
accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, 
when the constraint is on the number of available attributes per test example, we show that there 
are cases in which there exists a linear predictor with zero error but it is statistically impossible 
to achieve arbitrary accuracy without full information on test examples. Finally, we run simple 
experiments on a digit recognition problem that reveal that our algorithm has a good performance 
against both partial information and full information baselines. 


Keywords: budgeted learning, statistical learning, linear predictors, learning with partial informa- 
tion, learning theory 


1. Introduction 


Consider the problem of predicting whether a person has some disease based on medical tests. 
In principle, we may draw a sample of the population, perform a large number of medical tests 
on each person in the sample, and use this information to train a classifier. In many situations, 
however, this approach is unrealistic. First, patients participating in the experiment are generally 
not willing to go through a large number of medical tests. Second, each test has some associated 
cost, and we typically have a budget on the amount of money to spend for collecting the training 
information. This scenario, where there is a hard constraint on the number of training attributes the 
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learner has access to, is known as budgeted learning.'! Note that the constraint on the number of 
training attributes may be local (no single participant is willing to undergo many tests) or global (the 
overall number of tests that can be performed is limited). In a different but related budgeted learning 
setting, the system may be facing a restriction on the number of attributes that can be viewed at test 
time. This may happen, for example, in a search engine, where a ranking of web pages must be 
generated for each incoming user query and there is no time to evaluate a large number of attributes 
to answer the query. 


We may thus distinguish three basic budgeted learning settings: 


e Local Budget Constraint: The learner has access to at most k attributes of each individual 
example, where k is a parameter of the problem. The learner has the freedom to actively 
choose which of the attributes is revealed, as long as at most k of them will be given. 


e Global Budget Constraint: The total number of training attributes the learner is allowed to 
see is bounded by k. As in the local budget constraint setting, the learner has the freedom to 
actively choose which of the attributes is revealed. In contrast to the local budget constraint 
setting, the learner can choose to access more than k/m attributes from specific examples 
(where m is the overall number of examples) as long as the global number of attributes is 
bounded by k. 


e Prediction on a budget: The learner receives the entire training set, however, at test time, 
the predictor can see at most k attributes of each instance and then must form a prediction. 
The predictor is allowed to actively choose which of the attributes is revealed. 


In this paper we focus on budgeted linear regression, and prove negative and positive learning 
results in the three abovementioned settings. Our first result shows that, under a global budget 
constraint, no algorithm can learn a general d-dimensional linear predictor while observing less 
than Q(d) attributes at training time. This is complemented by the following positive result: we 
show an efficient algorithm for learning under a given local budget constraint of 2k attributes per 
example, for any k > 1. The algorithm actively picks which attributes to observe in each example 
in arandomized way depending on past observed attributes, and constructs a “noisy” version of all 
attributes. Intuitively, we can still learn despite the error of this estimate because instead of receiving 
the exact value of each individual example in a small set it suffices to get noisy estimations of many 
examples. We show that the overall number of attributes our algorithm needs to learn a regressor is at 
most a factor of d bigger than that used by standard regression algorithms that view all the attributes 
of each example. Ignoring logarithmic factors, the same gap of d exists when the attribute bound 
of our algorithm is specialized to the choice of parameters that is used to prove the abovementioned 
Q(d) lower bound under the global budget constraint. 

In the prediction on a budget setting, we prove that in general it is not possible (even with an 
infinite amount of training examples) to build an active classifier that uses at most two attributes of 
each example at test time, and whose error will be smaller than a constant. This in contrast with 
the local budget setting, where it is possible to learn a consistent predictor by accessing at most two 
attributes of each example at training time. 





1. See, for example, webdocs.cs.ualberta.ca/~greiner/BudgetedLearning/. 
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2. Related Work 


The notion of budgeted learning is typically identified with the “global budget” and “prediction 
on a budget” settings—see, for example, Deng et al. (2007), Kapoor and Greiner (2005a,b) and 
Greiner et al. (2002) and references therein. The more restrictive “local budget” setting has been 
first proposed in Ben-David and Dichterman (1998) under the name of “learning with restricted 
focus of attention”. Ben-David and Dichterman (1998) considered binary classification and showed 
learnability of several hypothesis classes in this model, like k-DNF and axis-aligned rectangles. 
However, to the best of our knowledge, no efficient algorithm for the class of linear predictors has 
been so far proposed.” 

Our algorithm for the local budget setting actively chooses which attributes to observe for each 
example. Similarly to the heuristics of Deng et al. (2007), we borrow ideas from the adversarial 
multi-armed bandit problem (Auer et al., 2003; Cesa-Bianchi and Lugosi, 2006). However, our 
algorithm is guaranteed to be attribute efficient, comes with finite sample generalization bounds, 
and is provably competitive with algorithms which enjoy full access to the data. A related but 
different setting is multi-armed bandit on a global budget—see, for example, Guha and Munagala 
(2007) and Madani et al. (2004). There one learns the single best arm rather than the best linear 
combination of many attributes, as we do here. Similar protocols were also studied in the context 
of active learning (Cohn et al., 1994; Balcan et al., 2006; Hanneke, 2007, 2009; Beygelzimer et al., 
2009), where the learner can ask for the target associated with specific examples. 

Finally, our technique is reminiscent of methods used in the compressed learning framework 
(Calderbank et al., 2009; Zhou et al., 2009), where data is accessed via a small set of random linear 
measurements. Unlike compressed learning, where learners are both trained and evaluated in the 
compressed domain, our techniques are mainly designed for a scenario in which only the access to 
training data is restricted. 

We note that a recent follow-up work (Hazan and Koren, 2011) present 1-norm and 2-norm 
based algorithms for our local budget setting, whose theoretical guarantees improve on those pre- 
sented in this paper, and match our lower bound to within logarithmic factors. 


3. Linear Regression 


We consider linear regression problems where each example is an instance-target pair, (x,y) € R? x 
R. We refer to x as a vector of attributes. Throughout the paper we assume that ||x||.. < 1 and 
|y| < B. The goal of the learner is to find a linear predictor x ++ (w,x). In the rest of the paper, 
we use the term predictor to denote the vector w € R. The performance of a predictor w on an 
instance-target pair, (x,y) € R? x R, is measured by a loss function £((w,x),y). For simplicity, we 
focus on the squared loss function, (a,b) = (a — b)?, and briefly mention other loss functions in 
Section 8. Following the standard framework of statistical learning (Haussler, 1992; Devroye et al., 
1996; Vapnik, 1998), we model the environment as a joint distribution D over the set of instance- 
target pairs, R? x R. The goal of the learner is to find a predictor with low risk, defined as the 
expected loss 














Low) È E Ew]. 





2. Ben-David and Dichterman (1998) do describe learnability results for similar classes but only under the restricted 
family of product distributions. 
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Since the distribution D is unknown, the learner relies on a training set of m examples 
S = {(x1,91),--+;(%ms¥m)}, which are assumed to be sampled i.i.d. from D. We denote the training 
loss by 


Q 


Ls(w) E — E (lw, xi) — yi)? 


z= 


Il 
pes 


4. Impossibility Results 


Our first result states that any budget learning algorithm (local or global) needs in general a budget 
of Q(d) attributes for learning a d-dimensional linear predictor. 


Theorem 1 For any d > 4 andé € (0, zz), there exists a distribution D over {—1,+1}4 x {-1, +1} 
and a weight vector w* € R4, with ||w*||o = 1 and ||w*||2 = ||w* ||, = 2v£, such that any learning 
algorithm must see at least 
1| d 
Ss, SR 
~ 2 | 96€ 


attributes in order to learn a linear predictor w such that Lp(w) — Lp(w*) < €. 


The proof is given in the Appendix. In Section 6 we prove that under the same assumptions as those 
of Theorem 1, it is possible to learn a predictor using a local budget of two attributes per example 
and using a total of Ola?) training examples. Thus, ignoring logarithmic factors hidden in the O 
notation, we have a multiplicative gap of d between the lower bound and the upper bound. 

Next, we consider the prediction on a budget setting. Greiner et al. (2002) studied this setting 
and showed positive results regarding (agnostic) PAC-learning of k-active predictors. A k-active 
predictor is restricted to use at most k attributes per test example x, where the choice of the i-th 
attribute of x may depend on the values of the i — 1 attributes of x that have been already observed. 
Greiner et al. (2002) show that it is possible to learn a k-active predictor from training examples 
whose performance is slightly worse than that of the best k-active predictor. But, how good are the 
predictions of the best k-active predictor? We now show that even in simple cases in which there 
exists a linear predictor w* with Lo(w*) = 0, the risk of the best k-active predictor can be high. 
The following theorem indeed shows that if the only constraint on w* is bounded 42 norm, then the 
risk can be as high as 1 — A We use the notation L(A) to denote the expected loss of the k-active 
predictor A on a test example. 


Theorem 2 There exists a weight vector w* € R? and a distribution D such that ||w*||2 = 1 and 
Lo(w*) =0, while any k-active predictor A must have L(A) > 1 — £. 


Note that the risk of the constant prediction of zero is 1. Therefore, the theorem tells us that no 
active predictor can get an improvement over the naive predictor of more than x 

Proof For any d > k let w* = (1 j vd,...,1 f vd). Let x € {+1} be distributed uniformly at random 
and y is determined deterministically to be (w*,x). Then, Lo(w*) = 0 and ||w*||2 = 1. Without loss 
of generality, suppose the k-active predictor asks for the first k attributes of a test example and forms 
its prediction to be y. Since the generation of attributes is independent, we have that the value of 
Xk+1,---,Xq does not depend neither on x;,...,x, nor on §. Using this and the fact that E[x jl = 0 for 
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all j we therefore obtain 
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which concludes our proof. | 


It is well known that a low 1-norm of w* encourages sparsity of the learned predictor, which nat- 
urally helps in designing active predictors. The following theorem shows that even if we restrict 
w* to have ||w*||1 = 1, Lo(w*) = 0, and ||w*||o > k, we still have that the risk of the best k-active 
predictor can be non-vanishing. 


Theorem 3 There exists a weight vector w* € R? and a distribution D such that ||w*||; = 1, 
Ly(w*) = 0, and ||w*||9 = ck (for c > 1) such that any k-active predictor A must have Ly(A) > 
ly 1 


E=] 


For example, if in the theorem above we choose c = 2, then ||w*||o = 2k and Lo(A) > x. If we 
choose instead c = H, then ||w*||o = k + 1 and Lo(A) > TI Note that if ||w*||o < k there is a 
trivial way to predict on a budget of k attributes by always querying the attributes corresponding to 
the non-zero elements of w*. 


Proof Let 
*— 1 1 
w =( KIEK ,0,...,0) 
a 


ck components 





and, similarly to the proof of Theorem2, let x € {+1}4 be distributed uniformly at random and let 
y be determined deterministically to be (w*,x). Then, Ly(w*) = 0, ||w*||1 = 1, and ||w*||o = ck. 
Without loss of generality, suppose the k-active predictor asks for the first k < ck attributes of a 
test example and form its prediction to be ¥. Again similarly to the proof of Theorem2, since the 
generation of attributes is independent, we have that the value of x,41,...,xq does not depend on 
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X1,--.,Xz, and on y. Therefore, 












































i=1 i>k 
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el 1 1\ 1 
oek cj ck 
which concludes our proof. a 


These negative results highlight an interesting phenomenon: in Section 6 we show that one can 
learn an arbitrarily accurate predictor w with a local budget of k = 2. However, here we show that 
even if we know the optimal w*, we might not be able to accurately predict a new partially observed 
example unless k is very large. Therefore, at least in the worst-case sense, learning on a budget is 
much easier than predicting on a budget. 


5. Local Budget Constraint: A Baseline Algorithm 


In this section we describe a straightforward adaptation of Lasso (Tibshirani, 1996) to the local 
budget setting. This adaptation is based on a direct nonadaptive estimate of the loss function. In 
Section 6 we describe a more effective approach, which combines a stochastic gradient descent 
algorithm called Pegasos (Shalev-Shwartz et al., 2007) with the adaptive sampling of attributes to 
estimate the gradient of the loss at each step. 

A popular approach for learning a linear regressor is to minimize the empirical loss on the 
training set plus a regularization term, which often takes the form of a norm of the predictor w. For 
example, in ridge regression the regularization term is I|w|I5 and in Lasso the regularization term 
is |w||1. Instead of regularization, we can include a constraint of the form ||w||1 < B or ||w|l2 < 
B. Modulo an appropriate choice of the parameters, the regularization form is equivalent to the 
constraint form. In the constraint form, the predictor is a solution to the following optimization 


problem 
min a y ((w,x) =y) 


weR¢ | S| (x,y)eS (1) 
st. [wll <B 


where S = {(x1,91),---, (%m,Ym)} is a training set of m examples, B is the regularization parameter, 
and p is 1 for Lasso and 2 for ridge regression. 
We start with a standard risk bound for constrained predictors. 


Lemma 4 Let D be a distribution on pairs (x,y) € R? x R such that ||x||.. < 1 and |y| < B holds 
with probability one. Then there exists a constant c > 0 such that 


jl. d 
max |Ls(w) —Lo(w)| =cB?,/—In = . 
Pe s( ) pl )| fi 5 


holds with probability at least 1 — 6 with respect to the random draw of the training set S of size m 
from D. 


2862 


EFFICIENT LEARNING WITH PARTIALLY OBSERVED ATTRIBUTES 


Proof We apply the following Rademacher bound (Kakade et al., 2008) 


[2 TI? 
|Ls(w)—Lo(w)| < LmaxB4/ —In2d + lmax4/ =— In = 
m 2m 98 


that holds with probability at least 1 — & for all w € R? such that ||w||; < B, where Lmax bounds 
the Lipschitz constant for the square loss from above, and max bounds the square loss from above. 
The result then follows by observing that | (a—y)*—(b —y)*| < |a—b||a+b—2y| . Hence, Lmax < 
maXa py |a +b — 2y| = 4B where both a and b are of the form (w,x), and we used the fact | (w,x)| < B 
(recall that ||x|| < 1) together with the assumption |y| < B. Similarly, under the same assumptions, 
Lmax = MaXqy(a —y)? =4B?. | 


This immediately leads to the following risk bound for Lasso. 


Corollary 5 [fw is a minimizer of (1) with p = 1, then there exists a constant c > 0 such that, under 
the same assumptions as Lemma 4, 


1 d 
Low) < in L Bln 2 
700) S ia PO) FP Y ag a 


holds with probability at least 1 — 6 over the random draw of the training set S of size m from D. 
To adapt Lasso to the partial information case, we first rewrite the squared loss as follows: 


((w,x) — y) =w xx w—2yx w+y 


where w, x are column vectors and w! ,x! are their corresponding transpose (i.e., row vectors). Next, 


we estimate the matrix xx! and the vector x using the partial information we have, and then we solve 
the optimization problem given in (1) with the estimated values of xx! and x. To estimate the vector 
x we can pick an index i uniformly at random from |d] = {1,...,d} and define the estimation to be 
a vector v such that 

dx, ifr=i 


r> : 3 
: 0 else ©) 














It is easy to verify that v is an unbiased estimate of x, namely, E[v] = x where expectation is with 
respect to the choice of the index i. To estimate the matrix xx! we could pick two indices i, j 
independently and uniformly at random from |d], and define the estimation to be a matrix with all 
zeros except d? xix j in the (i, j) entry. However, this yields a non-symmetric matrix which will 
make our optimization problem with the estimated matrix non-convex. To overcome this obstacle, 
we symmetrize the matrix by adding its transpose and dividing by 2. This sampling process can be 
easily generalized to the case where k > 1 attributes can be seen. The resulting baseline procedure? 
is given in Algorithm 1. 

The following theorem shows that similar to Lasso, the Baseline algorithm is competitive with 
the optimal linear predictor with a bounded 1-norm. 





3. We note that an even simpler approach is to arbitrarily assume that the correlation matrix is the identity matrix and 
then the solution to the loss minimization problem is simply the averaged vector, w = X(x y)esYx. In that case, we 
can simply replace x by its estimated vector as defined in (3). While this naive approach can work on very simple 
classification tasks, it will perform poorly on realistic data sets, in which the correlation matrix is not likely to be 
identity. Indeed, in our experiments with the MNIST data set, we found out that this approach performed poorly 
relatively to the algorithms proposed in this paper. 
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ALGORITHM: Baseline(S,k) 
INPUT: Training set S of size m, local budget k > 2 (with k even) 
INITIALIZE: A =0 E€ R?*? ; 5=0€E R? 5s y=0 


for each (x,y) € S 
v=0eER? ; A=0 eR% 
Choose a set C of k entries from |d], uniformly without replacement 
for each c € C 
Ve = Ve + 7 Xc 
Randomly split C into two sets I,J of size k/2 each 
for each (i,j) €1xJ 


d\* d\* 
Aij=Aij+2(2) XiXj 3; Ay =At2(2) XiXj 


end 
= - A v 2 
A=A+— ; řv=0+2y— ; payee 
m m m 
end 


Let Ls(w) =w!Aw+w! o+y 


OUTPUT: # = argmin Ls(w) 
w:||w||1<B 











Figure 1: An adaptation of Lasso to the local budget setting, where the learner can view at most 
k attributes of each training example. The predictive performance of this algorithm is 
analyzed in Theorem 6. 


Theorem 6 Let D be a distribution on pairs (x,y) € R? x R such that \|x||.. < 1 and |y| < B with 
probability one. Let Ŵ be the output of Baseline(S,k), where |S| =m. Then there exists a constant 
c > 0 such that 





dB\? [1 d 
Lo(w) < in L l l 
o(W) Se ee p(w) a k ) m ns 


holds with probability of at least 1 — 6 over the random draw of the training set S from D and the 
algorithm’s own randomization. 


The above theorem tells us that for a sufficiently large training set we can find a very good predictor. 
Put another way, a large number of examples can compensate for the lack of full information on each 
individual example. In particular, to overcome the extra factor (d/k)* in the bound, which does not 
appear in the full information bound given in (2), we need to increase m by a factor of (d/k)*. In 
the next subsection, we describe a better, adaptive procedure for the partial information case. 

In view of proving Theorem 6, we first show that sampling k elements without replacements 
and then averaging the result has the same expectation as sampling just once. 
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ee, 7 Let C be a set ofn elements and let f : C — R be an arbitrary function. Let G = {C' C 
: |C’| =k} and let U be the uniform distribution over Cy. Then 














deC cEC 


Ef. ZO] = EL 


Proof We have 














le LILO =m Lehre 
ukae- 


deC c'EC! 





= fle) {CEG sc EC }| 





and this concludes the proof. a 


We now show that the estimation matrix constructed by the Baseline algorithm is likely to be close 
to the true correlation matrix over the training set. 


Lemma 8 Let A; be the matrix consirunien at iteration t of the Baseline algorithm and note that 
A = ŁY! An LetX =+y" xx;. Then, with probability of at least 1 — 8 over the algorithm’s 
own randomness we have that 


d\? [8 (2d 
Ars —Xrs| < i ) =1,..., G 
| s s| ($) a n( 5 ) r,s d 


= 


Proof Based on Lemma 7, it is easy to verify that E 

















[A J= = x} x; Additionally, since we sample 
without replacements, each element of A, is in |- 2( ay" 2(¢) "| because we assume |]x;||.. < 1. 


Therefore, we can apply Hoeffding’s inequality on each element of A and obtain that 


me? (k\* 
P||Axs—Xrs| > E| < 2exp (= (5) . 


Combining the above with the union bound we obtain that 


- ‘3 2 /k\4 
P|3(r5) : |As —Xr.s| >e] < 2d°exp (5 (5) ‘ 


Setting the right-hand side of the above to ô and rearranging terms concludes the proof. a 





Next, we show that the estimate of the linear part of the objective function is also likely to be 
accurate. 
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Lemma 9 Let v, be the vector constructed at iteration t of the Baseline algorithm and note that 
p= D 2y,v,. Let x = 1E 2y,x;. Then, with probability at least 1 — ò over the algorithm’s 
own randomness we have that 


-ile S/n (= 
PASS k \m J` 














Proof Based on Lemma 7, it is easy to verify that E[2 y; v,] = 2y,x,;. Additionally, since we sample 


k elements without replacement, each element of v; is in [—¢,4] (because we assume |[x;||0 < 1) 
and thus each element of 2y,v; is in [— 742, 248) (because we assume that |y,| < B). Therefore, we 


can apply Hoeffding’s inequality on each element of ¥ and obtain that 


9: 2 
P|, -3 >e] < 2exp (7 (5) ) : 


Combining the above with the union bound we obtain that 


2 S me? k 2 
ars) : [Arns —X;.s| >e] < 2dexp (z (=) . 


Setting the right-hand side of the above to 6 and rearranging terms concludes proof. a 





Fd 
l 


Next, we show that the estimated training loss 
is(w) =w'Aw+w! 5+7 


computed by the Baseline algorithm is close to the true training loss. 


Lemma 10 With probability greater than 1 — ò over the Baseline algorithm’s own randomization, 
for all w such that ||w||1 < B, 


ts0w)—2500)] < (FE) y 2a (72). 


m 





Proof Using twice Hölder’s inequality and Lemma 8 we get 
Bd\* |8, (2d? 
< i . 4 
aaa ® 
Similarly, using Hélder’s inequality and Lemma 9 we also get 


Bed |8 2d 
Tis A 
|w' (v—x)| < ik =n(Z). (5) 
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Using the triangle inequality, (4)—(5), and the union bound we finally obtain 

















|Ls(w) —Ls(w)| = ee t+wlbp+y—w'Xw—w! x -3| 
< |w! "(A |w"( (¥ —X) 
(=) ac (+) tn eS (5) 
which upon slight simplifications concludes the rel a 


We are now ready to prove Theorem 6. 


Proof (of Theorem 6) Lemma 4 states that with probability greater than 1 — 6 over the random draw 
of a training set S of m examples, for all w such that ||w||ı < B, we have that 


j1 d 
|Ls(w)— Lo(w)| = c' B? mts 


for some c’ > 0. Combining the above with Lemma 10, we obtain that for some c > 0, with proba- 
bility at least 1 — 6 over both the random draw of the training set and the algorithm’s own random- 
ization, 


Z dB\? [1 d 

e DEA ea en E EA E <e ($) lni 
for all w such that ||w||ı < B. The proof of Theorem 6 follows since the Baseline algorithm mini- 
mizes Ls(w). a 


6. Gradient-Based Attribute Efficient Regression 


In this section, by avoiding the estimation of the matrix xx', we significantly decrease the number 


of additional examples sufficient for learning with k attributes per training example. To do so, we do 
not try to estimate the loss function but rather to estimate the gradient V¢(w) = 2((w,x) — y)x, with 


respect to w, of the squared loss function £(w) = ((w,x) — y). Each vector w defines a probability 
distribution P over [d] by letting P(i) = |wi|/||w||1. We can estimate the gradient using an even 
number k > 2 of attributes as follows. First, we randomly pick a subset i1, . . . ,ig/2 from [d] according 
to the uniform distribution over the k/2-subsets in [d]. Based on this, we estimate the vector x via 


k/2 
=" 3 Xi, ĉi, (6) 
where e; is the j-th element of the canonical basis of R?. Second, we randomly pick j1,..., jk /2 


from |d] without replacement according to the distribution defined by w. Based on this, we estimate 
the term (w,x) by 
k/2 


2 
$= lwl $ sgn(wj,) x), €j- (7) 
s= 


This allows us to obtain an unbiased estimate of the gradient, as stated by the following simple 
result. 


2867 


CESA-BIANCHI, SHALEV-SHWARTZ AND SHAMIR 


Lemma 11 Fix any w,x € Rf and y € R and let ¢(w) = ((w,x) —y) be the square loss. Then the 
estimate 


Ve(w) = 2(f—y)v (8) 
satisfies EV0(w) = 2((w,x) —y)x = Ve(w). 
Proof Since E[dxje;| = x for a random j € |d], Lemma 7 immediately implies that E[v] = x. 
Moreover, it is easy to see that E|||w||; sgn(w;)x;e;] = (w,x) when i is drawn with probability 
P(i) = |wi|/||w|l1. Hence E[y] = (w,x). The proof is concluded by noting that i;,...,i,/. are drawn 
independently from j,..., jx/2- a 






























































The advantage of the above approach over the loss based approach we took before is that the mag- 
nitude of each element of the gradient estimate is order of d ||w]|;. This is in contrast to what we had 
for the loss based approach, where the magnitude of each element of the matrix A was order of d°. 
In many situations, the 1-norm of a good predictor is significantly smaller than d and in these cases 
the gradient based estimate is better than the loss based estimate. However, while in the previous 
approach our estimation did not depend on a specific w, now the estimation depends on w. We 
therefore need an iterative learning method in which at each iteration we use the gradient of the loss 
function on an individual example. Luckily, the stochastic gradient descent approach conveniently 
fits our needs. 

Concretely, below we describe a variant of the Pegasos algorithm (Shalev-Shwartz et al., 2007) 
for learning linear regressors. Pegasos tries to minimize the regularized risk 


[wa —y)"] . (9) 


Of course, the distribution D is unknown, and therefore we cannot hope to solve the above problem 
exactly. Instead, Pegasos finds a sequence of weight vectors that (on average) converge to the 
solution of (9). We start with the all zeros vector w = 0 € R°. Then, at each iteration Pegasos picks 
the next example in the training set (which is equivalent to sampling a fresh example according to 
D) and calculates the gradient of the regularized loss 














min | Ib + 
=—||w uy 
w Qn" aD 


gw) = ŽIR + (0wa) 3) 


for this example with respect to the current weight vector w. This gradient is simply Vg(w) = 
Aw +V20(w), where Vé(w) = 2((w,x) —y)x. Finally, Pegasos updates the predictor according to 
the gradient descent rule w + w — $ Vg(w) where ¢ is the current iteration number. This can be 
rewritten as w + (1 — +)w— $ V2(w). 

To apply Pegasos in the partial information case we could simply replace the gradient vector 


V£(w) with its estimation given in (8). However, our analysis shows that it is desirable to maintain 





an estimation vector V@(w) with small magnitude. Since the magnitude of Ve(w) = 2(p—y)v is 
order of d||w||1, we would like to ensure that ||w||; is always smaller than some threshold B. We 
achieve this goal by adding an additional projection step at the end of each Pegasos’s iteration. 
Formally, the update is performed in two steps as follows 





1 1 
1 2(9 10 
w < ( >) re (9 y)v (10) 
w + argmin ||u— w|2 (11) 
u: ||ullı <B 
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ALGORITHM: AER(S,k) 
INPUT: Training set S of size m, local budget k > 2 (with k even) 
PARAMETER: A > 0 
INITIALIZATION: w=0ER? ; w=w;3 t=1 
for each (x,y) € S 
v=0ER? ; $=0 
Choose C uniformly at random from all subsets of [d] of size x 
for each j € C 


2 
wyeavjt pax; 





end 
for r=1,...,k/2 
sample i from [d] based on P(i) = = (if w =0 set P(i) = 1/d) 
Wii 


Aa th ed 
Dap sgn(wy) ||w]]1 x; 
end 


a ee a a 
w= A Cae 


w = argmin ||u— wl|2 





u: ||ullı <B 
_ Ww 
w=wt+— ; t=t+l 
m 
end 
OUTPUT: w 











Figure 2: An adaptation of the Pegasos algorithm to the local budget setting. Theorem 12 provides 
a performance guarantee for this algorithm. 


where v and ĵ are respectively defined by (6) and (7). The projection step (11) can be performed 
efficiently in time O(d) using the technique described in Duchi et al. (2008). A pseudo-code of the 
resulting Attribute Efficient Regression algorithm is given in Figure 2. 

Note that the right-hand side of (10) is w — $V f for the function 


fw) = Slfwll3 +20 —y) (vw) . (12) 


This observation is used in the proof of the following result, providing convergence guarantees for 
AER. 





Theorem 12 Let D be a distribution on pairs (x,y) € R? x R such that ||x||.. < 1 and |y| < B with 
probability one. Let S be a training set of size m and let w be the output of AER(S,k) run with 
à = 12d,\/log(m)/(mk). Then, there exists a constant c > 0 such that 


1 m 
Lo(w)< min L dB’4| — In = 
ow) ap ele nw) she km 5 ò 
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holds with probability at least 1 — 8 over both the choice of the training set and the algorithm’s own 
randomization. 


Proof Let y;,5;, vr, wr be the values of y, Ñ, v, w, respectively, at each iteration ¢ of the AER algorithm. 
Moreover, let V, = 2((w;,x;) — y;)x; and V; = 2(%, — y;)v;. From the convexity of the squared loss, 
and taking expectation with respect to the algorithm’s own randomization, we have that for any 
vector w* such that ||w*||1 < B, 


= 
| 


m 


((w;,%1) -n = L ((w*, x7) = <E 


t=1 
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= | Ëz- vr w — v9). 














For the first equality we used Lemma 11, which states that, conditioned on w,, E [V] = V;. 
We now deterministically bound the random quantity inside the above expectation as follows 


¥266.— yi) om —w") = Fon +265 - n)torm)) 


t=1 t=1 





i a 7 F Auz 
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t 1 


where f;(w) = A I|w||5 +28; — yr) (vr, w) is the A-strongly convex function defined in (12). Recalling 
that the right-hand side in the AER update (10) is equal to w; — $V fi(w:), we can apply the fol- 
lowing logarithmic regret bound for A-strongly convex functions (Hazan et al., 2006; Kakade and 
Shalev-Shwartz, 2008) 


Ès i -Eso < z (max||V i001) |/?) Inm 


which remains valid also in the presence of the projection steps (11). Similarly to the analysis of 
Pegasos, and using our assumptions on ||x;||.. and |y,|, the norm of the gradient V f, (w,) is bounded 


as follows 
2 
[VA = | ll < Ml + 48d? 


In addition, it is easy to verify (e.g., using an iductive argument) that 


1 2 
Il < taney 
2 
[VAwl ebay? 
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This gives the bound 


128(dB)* À 
(2E) Inm4 mz llw“ - 


Ms 


2(9; — yi) (Vr, Wr w~“) < 
; dK 


Choosing A = 16d,/log(m) /(km) and noting that || - ||2 < || - ||1 we get that 


2(9; — yr) (vr, we —W*) < 16dB°, / = Inm f 
1 
The resulting bound is then 


D Econ -n < E (a) Sy) + 16dB°4 J = Inm : 


To conclude the proof, we apply the online-to-batch conversion of Cesa-Bianchi et al. (2004, Corol- 
lary 2) to the probability space that includes both the algorithm’s own randomization and the prod- 








Il 


Ms 


t 














uct distribution from which the training set is drawn. Since ((w, xr) — yr) < 4B? for all w such 
that ||w||; < B (recall our assumptions on x; and y;), and using the convexity of the square loss, we 


obtain that 
Lo(w)< inf Ly(w)+16dB74/ i aal- hn 
E km m 98 


holds with probability at least 1 — with respect to all random events. a 





Note that for small values of k (which is the reasonable regime here) the bound for AER is much 
better than the bound for Baseline: ignoring logarithmic factors, instead of quadratic dependence 
on d, we have only linear dependence on d. 

It is interesting to compare the bound for AER to the Lasso bound (2) for the full information 
case. As it can be seen, to achieve the same level of risk, AER needs a factor of d? /k more examples 
than the full information Lasso.* Since each AER example uses only k attributes while each Lasso 
example uses all d attributes, the ratio between the total number of attributes AER needs and the 
number of attributes Lasso needs to achieve the same error is O(d). Intuitively, when having d times 
total number of attributes, we can fully compensate for the partial information protocol. 

However, in some situations even this extra d factor is not needed. Indeed, suppose we know 
that the vector w*, which minimizes the risk, is dense. That is, it satisfies ||w*||) ~ Vd ||w*||2 < B. 
In this case, by setting A = d*/? \/log(m) /(km), and using the tighter bound ||w*||2 < B/Vd instead 
of ||w*]|2 < ||w*||1 < B in the proof of Theorem 12, we get a final bound of the form 

d m 


Lo(w) < Lolw*) + cB? ae In 5 





Therefore, the number of examples AER needs in order to achieve the same error as Lasso is only 
a factor d/k more than the number of examples Lasso uses. But, this implies that both AER and 
Lasso needs the same number of attributes in order to achieve the same level of error! Crucially, the 
above holds only if w* is dense. When w* is sparse we have ||w*||1 ~ ||w*||2 and then AER needs 
more attributes than Lasso. 





4. We note that when d = k we still do not recover the full information bound. However, it is possible to improve the 
analysis and replace the factor d//k with a factor d (max; ||x;||2) /k. 
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YHEAS 
3121515 


Figure 3: In the upper row six examples from the training set (of digits 3 and 5) are shown. In the lower 
row we show the same six examples, where only four randomly sampled pixels from each original 
image are displayed. 






7. Experiments 


We performed some experiments to test the behavior of our algorithm on the well-known MNIST 
digit recognition data set (Le Cun et al., 1998), which contains 70,000 images (28 x 28 pixels each) 
of the digits 0— 9. The advantages of this data set for our purposes is that it is not a small scale 
data set, has a reasonable dimensionality-to-data-size ratio, and the setting is clearly interpretable 
graphically. While this data set is designed for classification (e.g., recognizing the digit in the 
image), we can still apply our algorithms on it by regressing to the label. 

First, to demonstrate the hardness of our settings, we provide in Figure 3 below some examples 
of images from the data set, in the full information setting and the partial information setting. The 
upper row contains six images from the data set, as available to a full information algorithm. A 
partial information algorithm, however, will have a much more limited access to these images. In 
particular, if the algorithm may only choose k = 4 pixels from each image, the same six images as 
available to it might look like the bottom row of Figure 3. 

We began by looking at a data set composed of “3” vs. “5”, where all the “3” digits were labeled 
as —1 and all the “5” digits were labeled as +1. We ran four different algorithms on this data set: the 
simple Baseline algorithm, AER, as well as ridge regression and Lasso for comparison (for Lasso, 
we solved (1) with p = 1). Both ridge regression and Lasso were run in the full information setting: 
Namely, they enjoyed full access to all attributes of all examples in the training set. The Baseline 
algorithm and AER, however, were given access to only four attributes from each training example. 

We randomly split the data set into a training set and a test set (with the test set being 10% of the 
original data set). For each algorithm, parameter tuning was performed using 10-fold cross valida- 
tion. Then, we ran the algorithm on increasingly long prefixes of the training set, and measured the 
average regression error ((w,x) —y)? on the test set. The results (averaged over runs on 10 random 
train-test splits) are presented in Figure 4. In the upper plot, we see how the test regression error 
improves with the number of examples. The Baseline algorithm is highly unstable at the beginning, 
probably due to the ill-conditioning of the estimated covariance matrix, although it eventually stabi- 
lizes (to prevent a graphical mess at the left hand side of the figure, we removed the error bars from 
the corresponding plot). Its performance is worse than AER, completely in line with our earlier 
theoretical analysis. 
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The bottom plot of Figure 4 is similar, only that now the X-axis represents the accumulative 
number of attributes seen by each algorithm rather than the number of examples. For the partial- 
information algorithm, the graph ends at approximately 49,000 attributes, which is the total number 
of attributes accessed by the algorithm after running over all training examples, seeing k = 4 pixels 
from each example. However, for the full-information algorithms 49,000 attributes are already 
seen after just 62 examples. When we compare the algorithms in this way, we see that our AER 
algorithm achieves excellent performance for a given attribute budget, significantly better than the 
other 1-norm-based algorithms (Baseline and Lasso). Moreover, AER is even comparable to the 
full information 2-norm-based ridge regression algorithm, which performs best on this data set. 
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Figure 4: Test regression error for each one of the four algorithms (ridge regression, Lasso, AER, and Base- 
line), over increasing prefixes of the training set for “3” vs. “5”. The results are averaged over 10 
runs. 
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Finally, we tested the algorithms over 45 data sets generated from MNIST, one for each possible 
pair of digits. For each data set and each of 10 random train-test splits, we performed parameter 
tuning for each algorithm separately, and checked the average squared error on the test set. The 
median test errors over all data sets are presented in the table below. 





Test Error 
Full Information Ridge 0.110 
Lasso 0.222 
Partial Information AER 0.320 
Baseline 0.812 























As can be seen, the AER algorithm manages to achieve good performance, not much worse 
than the full information Lasso algorithm. The Baseline algorithm, however, achieves a substan- 
tially worse performance, in line with our theoretical analysis above. We also calculated the test 
classification error of AER, that is, sign((w,x)) Æ y, and found out that AER, which can see only 
4 pixels per image, usually performs only a little worse than the full information algorithms (ridge 
regression and Lasso), which enjoy full access to all 784 pixels in each image. In particular, the 
median test classification errors of AER, Lasso, and Ridge are 3.5%, 1.1%, and 1.3% respectively. 


8. Discussion and Extensions 


In this paper we have investigated three budgeted learning settings with different constraints on the 
way instance attributes may be accessed: a local constraint on each training example (local budget), 
a global constraint on the set of all training examples (global budget), and a constraint on each test 
example (prediction on a budget). In the local budget setting, we have introduced a simple and 
efficient algorithm, AER, that learns by accessing a pre-specified number of attributes from each 
training example. The AER algorithm comes with formal guarantees, is provably competitive with 
algorithms which enjoy full access to the data, and performs well in simple experiments. This result 
is complemented by a general lower bound for the global budget setting which is a factor d smaller 
than the upper bound achieved by our algorithm. We note that this gap has been recently closed 
by Hazan and Koren (2011), which in our local budget setting, show 1-norm and 2-norm-based 
algorithms for learning linear predictors using only O(d ) attributes, thus matching our lower bound 
to within logarithmic factors. 

Whereas AER is based on Pegasos, our adaptive sampling approach easily extends to other 
gradient-based algorithms. For example, generalized additive algorithms such as p-norm Percep- 
trons and Winnow—see, for example, Cesa-Bianchi and Lugosi (2006). 

In contrast to the local/global budget settings, where we can learn efficiently by accessing few 
attributes of each training example, we showed that accessing a limited number of attributes at test 
time is a significantly harder setting. Indeed, we proved that is not possible to build an active linear 
predictor that uses two attributes of each test example and whose error is smaller than a certain 
constant, even when there exists a linear predictor achieving zero error on the same data source. 

An obvious direction for future research is how to deal with loss functions other than the squared 
loss. In related work (Cesa-Bianchi et al., 2010), we developed a technique which allows us to 
deal with arbitrary analytic loss functions. However, in the setting of this paper, those techniques 
would lead to sample complexity bounds which are exponential in d. Another interesting extension 
we are considering is connecting our results to the field of privacy-preserving learning (Dwork, 
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2008), where the goal is to exploit the attribute efficiency property in order to prevent acquisition of 
information about individual data instances. 
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Appendix A. Proof of Theorem 1 


The outline of the proof is as follows. We define a specific distribution such that only one “good” 
feature is slightly correlated with the label. We then show that if some algorithm learns a linear 
predictor with an extra risk of at most e, then it must know the value of the good feature. Next, we 
construct a variant of a multi-armed bandit problem out of our distribution and show that a good 
learner can yield a good prediction strategy. Finally, we adapt a lower bound for the multi-armed 
bandit problem given in Auer et al. (2003), to conclude that the number k of attributes viewed by a 
good learner must satisfy k = Q(4). 


A.1 The Distribution 


We generate a joint distribution over R? x R as follows. Choose some j € [d]. First, we generate 
y1,¥2,--- E {1} iid. according to Ply, 1] Ply, —1] 5. Given j and y,, x, E€ {+1} is 
generated according to P [xri = yı] = 5 + 1{i= j}p where p > 0 is chosen later. Denote by P; 
the distribution mentioned above assuming the “good” feature is j. Also denote by P,„ the uniform 
distribution over {+1}¢+!. Analogously, we denote by E j and E, expectations w.r.t. P; and P,. 























A.2 A Good Regressor “Knows” j 


We now show that if we have a good linear regressor than we can know the value of j. It is easy to 
see that the optimal linear predictor under the distribution P ; is w* = 2p e/, and the risk of w* is 





Lp,(w*) = Ej [((w*,x) —y)?] = (4 +p) (l—-2p)* + (5 —p) (1+2p)? = 14+4p*—8p? =1-4p’. 


The risk of an arbitrary weight vector w under P; is 


Lp,(w) = E,|((w,x) =y)" = Ew +E; [waxy] = Ew +w +1 —4pwj : 
ifj ifj 


Suppose that Lp, (w) — Lp ,(w*) < €. This implies that: 


1. For all i j we have w? < e, or equivalently, |w;| < v£. 





2. 1+w? 


« —4pw;— (1 4p”) < £ and thus |w; —2p| < v£ which gives |w;| > 2p — v£. 


By choosing p = y£, the above implies that we can identify the value of j from any w whose risk 
is strictly smaller than Lp ,(w*) +€. 
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A.3 Constructing A Variant Of A Multi-Armed Bandit Problem 


We now construct a variant of the multi-armed bandit problem out of the distribution P;. Each 
coordinate i € {1,...,d} is an arm and the reward of pulling i at time t is 1{xy,,,i = Yn, } € {0,1}, 
where N;,; denotes the random number of times arm i has been pulled in the first ¢ plays. Hence the 
expected reward of pulling i is 5 + 1{i = j}p. At the end of each round ¢ the player observes xy,, 
and YNit 


A.4 A Good Learner Yields A Bandit Strategy 


Suppose that we have a learner that, for any j = 1,...,d, can learn a linear predictor with Lp ,(w) — 
Lp ,(w*) < € using k attributes. Since we have shown that once Lp ,(w) — Lp,(w*) < € we know the 
value of j, we can construct a strategy for the multi-armed bandit problem in a straightforward way. 
Simply use the first m examples to learn w and from then on always pull the arm j. The expected 
reward of this strategy under any IP; after T > k plays is at least 





S(T a(5+e)=54+07 ip (13) 


A.5 An Upper Bound On the Reward Of Any Bandit Strategy 


Recall that under distribution P ; the expected reward for pulling arm 7 is 5 + p1{I = j}. Hence, 
the total expected reward of a player that runs for T rounds is upper bounded by 5T + pE;[N;], 
where N; = N; r is the overall number of pulls of arm j. Moreover, at the end of each round f the 
player observes xs, and ys, where s = N; +. This allows the player to compute the value of the reward 
for the current play. For any s, note that ys is observed whenever some arm i is pulled for the s-th 
time. However, since P; [is = ys] =P; Nes =y ys] for all i (including i = j), the knowledge of 
ys does not provide any information about the distribution of rewards for arm i. Therefore, without 
loss of generality, we can assume that at each play the bandit strategy observes only the obtained 
binary reward. This implies that our bandit construction is identical to the one used in the proof of 
Theorem 5.1 in Auer et al. (2003). In particular, for any bandit strategy there exists some arm j such 
that the expected reward of the strategy under distribution P ; is at most 





T T T T T 6T 
T In(1—4p2) |<- a rp eka? 14 
z (5+ 7 int P) <40(F4 Tp) (14) 


where we used the inequality — In(1 — q) < 3q for q € [0,1/4]. Note that q = 4p? = 4e € [0,1/4] 
when € < 1/16. 


A.6 Concluding The Proof 


Take a learning algorithm that finds an €-good predictor using k attributes. Since the reward of the 
strategy based on this learning algorithm cannot exceed the upper bound given in (14), from (13) 


we obtain that 
T T T 6T 
| ae ETA =p’ 
at -kpsatp ( At ae 
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which solved for k gives 
1 6T 
k>T{|1——~-4/—p?]. 


Since we assume d > 4, choosing T = |d /(96p°)], and recalling p = £, gives 
ee eee ee 
T2 2 | 96€ 
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